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Abstract 

This paper presents a structured ordinal measure method 
for video-based face recognition that simultaneously learns 
ordinal filters and structured ordinal features. The prob¬ 
lem is posed as a non-convex integer program problem that 
includes two parts. The first part learns stable ordinal fil¬ 
ters to project video data into a large-margin ordinal space. 
The second seeks self-correcting and discrete codes by bal¬ 
ancing the projected data and a rank-one ordinal matrix in 
a structured low-rank way. Unsupervised and supervised 
structures are considered for the ordinal matrix. In addi¬ 
tion, as a complement to hierarchical structures, deep fea¬ 
ture representations are integrated into our method to en¬ 
hance coding stability. An alternating minimization method 
is employed to handle the discrete and low-rank constraints, 
yielding high-quality codes that capture prior structures 
well. Experimental results on three commonly used face 
video databases show that our method with a simple vot¬ 
ing classifier can achieve state-of-the-art recognition rates 
using fewer features and samples. 

1. Introduction 

Video-sharing websites are a fast-growing platform that 
allows internet users to distribute their video clips. There 
are often a large number of face videos in these websites. 
How to index, retrieve, and classify these face videos has 
become an active research topic in the area of video-based 
face recognition (VFR). Current VFR methods often per¬ 
form recognition based on hundreds or thousands of float¬ 
ing point features, and store almost every face sample from 
a video clip. Since there can be (many) thousands of face 
samples in a video clip, high-dimensional dense features 
and large-scale registered samples result in tremendously 
large time and space complexity, which becomes a compu¬ 
tational bottleneck when applying VFR methods to video¬ 
sharing websites. 

Recently, binary code representations have drawn much 
attention in biometric recognition [5][21] [27] and large 


scale image retrieval [26] [13] [24]. Among these binary 
coding methods, codes constructed from ordinal measures 
(OM) are one representative method. Ordinal measures [31] 
are common in human perceptual judgments. It is easy 
and natural for humans to rank or order the heights of two 
persons, although it is hard to estimate their precise differ¬ 
ences [33]. Ordinal measures were originally used in social 
science [3 1 ] and then introduced to computer vision. 

In biometrics, an OM is defined as the relative ordering 
of some property - for example, the average brightness of 
two adjacent regions (with 1 coding A >- B and 0 cod¬ 
ing A ^ B) or the relative ordering of two color channels 
within the same region. Ordinal filters with a number of 
tunable parameters, are methods to analyze the ordinal mea¬ 
sures of image features. The Haar wavelet and quadratic 
spline wavelet can be regarded as typical ordinal filters. Or¬ 
dinal features are the binary codes of image features ob¬ 
tained by thresholding ordinal filters. Fig. 1 plots a simple 
illustration of OM. 

In prior work, the set of handcrafted ordinal filters is cho¬ 
sen to correspond to some family of coherent patterns - like 
Gabor filters. The space of ordinal filters can therefore be 
quite large as the tunable parameters - scale, frequency, ori¬ 
entation - are varied, each giving rise to a potential ordinal 
feature. Different feature selection methods [32][33] [40] 
have been used for OM to select a stable subset from the 
over-complete ordinal features. The term ’stable’ indicates 
that the floating point features generated by an ordinal filter 
from the same class are expected to have large margins so 
that the corresponding ordinal features (binary codes) are 
robust to intra-class variations during binarization. 

Motivated by the success of OM in iris [32], palm- 
print [33] and face recognition [5], we present what we re¬ 
fer to as a structured ordinal measure (SOM) method for 
video-to-video face recognition. Different from previous 
handcrafted OM methods, SOM simultaneously learns or¬ 
dinal filters (SVM’s) and structured ordinal features (binary 
codes) from video data as shown in Fig. 1. Considering that 
face appearances in video clips contain several facial varia¬ 
tions and are similar in adjacent frames, we design the ordi- 
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Figure 1. An illustration of structured ordinal measures. Ordinal measure of visual relationship between two regions [32] [33]. Previous 
OM methods apply feature selection methods to select over-complete ordinal features (binary codes) that are generated by handcrafted 
ordinal filters. SOM simultaneously seeks ordinal filters and optimal ordinal features in a data-driven way, makes the learned features 
low-rank and enforces an optimal ordinal matrix for classification. In SOM, one binary code of a sample can be corrected according to the 
codes of similar samples. 


nal features of SOM to be stable and self-correcting binary 
codes. Stability indicates that the learned ordinal features 
are required to have large margins and to be clustered. The 
self-correcting character indicates that binary code of one 
frame depends not only on its corresponding ordinal filter 
(or coding function) but also on the binary values of sim¬ 
ilar (typically nearby in time) face samples. Because face 
images in a video clip often lie in a union of multiple linear 
subspaces [7] [43], the features (binary code) assigned to the 
subset of faces from a single linear subspace should be simi¬ 
lar. These binary codes can be potentially corrected by each 
other through a low-rank constraint on the matrix of con¬ 
structed codes. One of the main advantages of our method 
is that it simultaneously reduces the number of dense fea¬ 
tures and eliminates redundant samples^. 

We will formulate the SOM problem as a non-convex in¬ 
teger program problem that mainly includes two parts. The 
first part learns stable ordinal filters to project video data 
into a space in which the filtered data are separable with 
a maximum margin. This can be viewed as an instance of 
maximum margin clustering (MMC) [41 ]. The second finds 
self-correcting binary codes by balancing the projected real- 
value data and a rank-one ordinal matrix in a structured 
low-rank way. Unsupervised and supervised structures are 
considered for the ordinal matrix. We also integrate CNN 
feature representations into our method to enhance stabil¬ 
ity. An alternating optimization method provides an effi¬ 
cient discrete solution to deal with the discrete and low-rank 
constraints imposed on binary ordinal features. In addition, 
a simple voting classifier with a self-correcting process is 

* Getting rid of redundant samples is important during both training and 
testing. In a video clip, the face can remain unchanging for long periods of 
time and that would bias the models towards that appearance. 


proposed to efficiently compress and classify video clips. 
Experimental results on three commonly used face video 
databases show that our SOM method can achieve state-of- 
the-art recognition results using fewer features and samples. 
Compared to previous binary coding methods for still im¬ 
ages (face or iris), SOM more efficiently utilizes the low- 
rank property of video data and hence is potentially useful 
for VFR problems. 

There are three major contributions of this work: 

1) By employing the optimal ordinal matrices as output 
structures, SOM encourages ordinal features from the same 
class to have similar binary codes. To the best of our knowl¬ 
edge, SOM is the first algorithm that learns binary codes (or 
hashing) using output structures. 

2) Assuming that face images of a video clip lie in 
a union of linear subspaces, we propose a self-correcting 
method to discretely binarize both gallery and probe videos. 
Our method utilizes the continuous information in videos 
and hence is effective for VFR tasks. 

3) As a by-product of SOM, we show that using a sim¬ 
ple voting classifier improves over competing and com¬ 
plex classification models on fine grained datasets like the 
YouTube Celebrities dataset and offers an impressive com¬ 
pression ratio of CNN floating point features (20% face 
samples and 64-bit binary codes). 

The rest of this paper is organized as follows. We briefly 
review some recent advances on binary coding methods in 
Section 2. In Section 3 and Section 4, we present the de¬ 
tails of SOM and the optimal ordinal matrices respectively. 
Section 5 provides experimental results, prior to summary 
in Section 6. 
















2. Related work 

Since OM methods are an instance of binary appearance 
features, we briefly review some recent advances on binary 
coding methods. 

2.1. Biometric recognition 

In biometrics, binary feature representation methods of¬ 
ten focus on directly computing local image patches by the 
Alters to generate binary codes. Local binary patterns (LBP) 
and ordinal measures are two representative binary features. 
There are many variations of these two features [5] [2 1 ]. The 
definition and properties of OM in the context of biometrics 
can be found in [32]. 

Although OM’s has been successfully applied to biomet¬ 
rics, there are still two open issues for OM. The Arst issue is 
the design of ordinal Alters. The existing ordinal Alters are 
often handcrafted. But handcrafted ordinal Alters are too 
simple to represent complex human vision structures [23]. 
In addition, to improve stability and accuracy, these Alters 
often contain a large number of parameters based on dis¬ 
tance, scale and location, resulting in a potential feature set 
of OM. This naturally leads to the second issue, i.e., how to 
select the optimal set of ordinal features. Although various 
feature selection methods [32] [40] [33] have been employed 
to improve selection results, it is still difflcult for a feature 
selection algorithm to select the optimal set from the over¬ 
complete set of OM. 

Recently, data-driven binary feature methods, which 
learn local image Alters from data, have drawn much atten¬ 
tion. Cao et al. [4] utilized unsupervised methods (random- 
projection trees and PCA trees) to learn binary represen¬ 
tations. Lei et al. [21] proposed a LBP-like discriminant 
face descriptor (DFD) by combining image Altering, pat¬ 
tern sampling and encoding. Chan et al. [6] combined cas¬ 
cade PCA, binary code learning and block-wise histograms 
to learn a deep network. Lu et al. [27] proposed a com¬ 
pact binary face descriptor (CBFD) to remove the redun¬ 
dancy information of face images. Although these methods 
indeed boost recognition performance on some challenging 
databases, their learned features are often high dimensional. 
For example, the dimensionality of histogram feature vec¬ 
tors of DFD and CBFD are 50,176 and 32,000 respectively. 
High dimensional and dense representations make these 
data-driven methods not applicable to VFR problems. 

2.2. Image retrieval 

Learning binary codes (’hashing’) has been a key 
step to facilitate large-scale image retrieval. In im¬ 
age retrieval, the terminology ’hashing’ refers to learning 
compact binary codes with Hamming distance computa¬ 
tion. Similarity-sensitive hashing or locality-sensitive hash¬ 
ing algorithms [38] [19], graph-based hashing [25], semi- 
supervised learning [34], support vector machine [28][30], 


Riemannian manifold [22], decision trees [24] and 
deep learning [13] [39] have been studied to map high¬ 
dimensional data into a low-dimensional Hamming space. 
The authors in [25] [30] argued that the degraded perfor¬ 
mance of hashing methods is due to the optimization proce¬ 
dures used to achieve discrete binary codes. Hence [25] [30] 
tried to enforce binary constraints to directly obtain discrete 
codes [25] [30]. A brief review of hashing methods for im¬ 
age search can be found in [13][35]. 

These hashing methods are often used for image search 
and retrieval but they may not achieve the highest accuracy 
for VFR problems. For example, the constraints in [25] 
maximize the information from each binary code over all 
the samples in a training set. However, adjacent face sam¬ 
ples in a video clip often have nearly the same appearance 
so that these samples can have similar binary codes. In ad¬ 
dition, to the best of our knowledge, there is no existing 
hashing methods that address image-set problems [8]. 

3. Structured ordinal measures (SOM) 

3.1. Motivation 

Consider a training set X from C classes, which consists 
of n biometric samples Xj (1 < j < n) in a high dimen¬ 
sional Euclidean space R‘^. The goal of previous OM meth¬ 
ods is to identify ordinal Alters over X to nonlinearly map 
each Xj to m ordinal features (an m-bit binary code). Since 
ordinal Alters typically have a number of tunable parame¬ 
ters and so determine a huge set of possible ordinal features, 
various feature selection methods have been used to select 
the m ordinal features. The selected ordinal features of all 
samples form a binary matrix B = [6i,..., 6„] G 
referred to as an ordinal matrix. Previous OM methods se¬ 
lect ordinal Alters one by one (using a greedy approach) and 
hence neglect the output structure of ordinal features. For 
example, video data are often low-rank. 

In biometrics, since intra-class variations of biometric 
samples are often very large, good ordinal measures should 
generate similar binary codes for the samples from one sub¬ 
ject. In addition, a large difference between two quantities 
will result in more stable binary features. For example, the 
greater the color difference between two image regions, the 
more easily humans order their relative brightness (1 or 0); 
and the greater the height difference between two persons, 
the more easily humans rank their relative heights. 

To obtain stable ordinal features, we introduce the fol¬ 
lowing minimization problem for OM, 

min +Al IIVFII 2 + E II-S1L (1) 

W,i,B c 

s.t. Bij{w^ Xj) > 1 — ^ij, 

> 0, B,j G {-1,1} 

where ^ and Ai are constants, and ||.||^ denotes the matrix 


trace norm (i.e., the sum of its singular values). repre¬ 
sents all ordinal features from the c-th class. The parameter 
matrix W = [wi,..., Wm] € represents a set of or¬ 

dinal filters. As defined in Section 2, a parameter matrix 
W contains a set of ordinal filters only if W can result in 
consistent orders for the samples from the same class, e.g., 
generates an ordinal matrix as in Fig 2. In contrast 
to the binary coding methods [32] [21] [27] that are based on 
local image patches, (1) directly uses the whole image as an 
input to find compact codes^. More important, (1) aims to 
simultaneously seek ordinal filters (W) and optimal ordinal 
features {B). 

The low-rank constraint in (1) encourages the ordinal 
features from the same class to be correlated. This con¬ 
straint reduces the redundancy of video data and corrects 
some binary codes whose corresponding values (VF^X) are 
close to SVM’s separating hyperplanes. We also want to 
enforce that the learned B is close to the optimal ordinal 
(binary) matrix for classification, resulting in the following 
minimization problem, 

min pf + Ai IIIFII 2 + E II^IL + A 2 \\B - (2) 

s.t. B,j{wfXj) > 1 - > 0, B,J G {-1,1} 

where S G is a prior ordinal matrix that defines a 

desired output structure for ordinal features. We postpone 
discussion of the design of S until Section 4. Since the 
OM problem in (2) imposes an output structure on ordinal 
filter learning, we refer to the problem in (2) as learning a 
structured ordinal measure. 

Even without the structured low-rank constraint, (2) is 
difficult to solve [41]. Unlike supervised SVM that can 
be formulated as a convex optimization problem, (2), even 
without the structured low-rank constraint, is still a non- 
convex integer optimization problem. It is an instance of 
maximum margin clustering [41]. To simplify the mini¬ 
mization of (2), we relax (2) by introducing an equality con¬ 
straint on B as follows, 

min^e + Ai IIIUII2 + E \\B% + A2 \\B - 5 ||^ + \\E\\l 

C 

s.t. B = W^X + E, B,j G {-1,1} , (3) 

Bij (tCi Xj ) > 1 — , ^ij > 0 

where E G is an error term to reduce the loss dur¬ 

ing binarization. Since ||i3 — S”!!^ = Ec 11-®^ “ 
actually seeks discrete binary codes by balancing floating 
point data W^X and a rank-one ordinal matrix S'° in a 
structured low-rank way. 

^In face recognition, dividing a face image into small patches can cap- 
ture nonlinear facial variations well and so improves recognition rates. The 
learned filters in (1) can also be applied to local patches as in previous bi¬ 
nary coding methods. 


Our SOM formulation in (3) has two major advantages: 

1) the introduction of the low-rank constraint and error term 
makes SOM more flexible during binarization. The learned 
binary codes depend on their corresponding floating point 
values as well as prior structures. Different from the binary 
codes that are directly generated by ordinal Alters or hashing 
functions, the binary codes of SOM can be self-corrected by 
the structure constraints, resulting in self-correcting codes. 

2) Since 5'° is a rank-one matrix, A 2 plays the role of con¬ 
trolling the number of learning samples. The rank-one ma¬ 
trix indicates that there is only one unique sample in this 
matrix. The larger the value of A 2 , the more B^^ resem¬ 
bles S"^. In practice, the rank of B° will be larger than one 
because a face video clip often contains several face varia¬ 
tions. 

3.2. Optimization 

The optimization problem in (3) is a hard computational 
problem (non-convex integer optimization), which belongs 
to the class of maximum margin clustering problems [41]. 
Fortunately, we do not need to And the global minimum be¬ 
cause local minima produce good ordinal features. Hence 
we can decompose the non-convex problem in (3) into sub¬ 
problems as in MMC. A local minimum can be obtained 
by solving a series of SVM training and binary code learn¬ 
ing problems. An overview of our iterative algorithm is as 
follows. 

First, fixing variables B and E, we minimize (3) w.r.t. 
variables W and resulting in a multiple linear SVM prob¬ 
lem in (4) (one for each ordinal feature) [10]. To learn the 
i-th SVM the columns of X and the elements of the ith 
row of B are used as training data and labels respectively. 

minp^-f Ai IIIUII 2 (4) 

s.t. Bijiwf Xj) > 1 - {,ij,iij > 0 

Second, fixing variables W and (3) takes the following 
form w.r.t. B and E, 

minEI|SlL+A2||i?-5||^ + ||F;||^ (5) 

s.t. B = A + E, Bij G {-1,1} 

where A = By substituting the equality con¬ 

straint into the objective function of (5), we can reformulate 
(5) as follows, 

min ||A - B\\l + E \\B% + A 2 ||i? - 5||^ (6) 

^ C 

s.t. Bij G {—1,1} 

Since ||.|||n is separable, the solution of (6) can be indepen¬ 
dently obtained by minimizing the following subproblem 

^The £i regularized linear SVM is implemented by LIBLINEAR: 
http : //www . csie . ntu . edu . tw/~cjlin/libsvm 




for each class c, 

mm P- - B% + \\B% + \\B^ - (7) 

s.t. G{-1,1} 

To minimize the low-rank problem in (7), we first need to 
introduce a variational formulation for the trace norm [14], 

Lemma 1 Let B € The trace norm of B is equal 

to: 

II^IL = 5 (B^L-^B) + tr{L) (8) 

and the infimum is attained for L = {BB'^f^'^. 

Using this lemma, we can reformulate (7) as, 

minmin - BUI p-f (9) 

+X 2 \\B^ - + tr{L) s.t. B% G {-1,1} 

The problem in (9) can be alternately minimized. When L 
is fixed, we can use the discrete cyclic coordinate descent 
method to obtain bit by bit. For simplicity, we develop 
a simple and direct method to find B'^. That is, disregarding 
the integer constraint, the solution of B'^ takes the following 
form by setting the derivative of (9) w.r.t. B'^ equal to zero, 

B" = ((1 + A2)/ + B-1)\P" + A2B"). (10) 

Given a floating point B° in one iteration, we can use the 
sign function sgn{.) to obtain binary-value sgn{B^). Ex¬ 
perimental results show that the learned binary codes are 
good enough for VFR. Algorithm 1 summarizes the pro¬ 
cedure to learn structured ordinal filters. A 2 is set to 0.1 
throughout this paper. 


Algorithm 1: Learning structured ordinal filters 

Input; Data matrix X G and ordinal matrix 

S G B™''" 

Output; Ordinal Filters W G B'^^™ 

1 : repeat 

2: Train m linear-SVMs to update W using B*“^ as 

training labels. 

3: Compute A = a:. 

4: repeat 

5: Compute L = (B°B^^)^/^. 

6: Compute B^ via (10). 

7: Let B° = sgn{B'^). 

8: until The variation of B is smaller than a threshold. 

9: t=tH-l. 

10: until The variation of B is smaller than a threshold. 


3.3. Classification 

When applying SOM (or binary code learning methods) 
to biometric recognition, SOM must generate ordinal fea¬ 
tures for any data sample beyond the sample points in the 
training set X. Given a new probe dataset X^, a hashing 
algorithm H with parameter W typically applies the sign 
function sgn{.) to the hashing function f^{XP) to obtain 
the binary codes [25][30], i.e., B^ = sgn{f^{XP)). 

VFR can be viewed as an image-set classification/retrival 
problem [8]. The samples in a probe (or gallery) dataset are 
from a video clip and so have a low-rank structure. Hence, 
instead of using the sign function, we propose a low-rank 
method to construct the binary codes for a probe video as 
follows, 

mm{||Bf^ + ||B||J (11) 

s.t.B = f^iXP)+E, B,, G {-1,1} 

Compared to directly using the sign function sgn{.) to ob¬ 
tain binary codes, (11) utilizes a low-rank prior to find bi¬ 
nary codes. This makes the binary codes B not only depend 
on the function f^{.). The values in B can be potentially 
changed (or corrected) by each other due to the low-rank 
constraint. (11) is a sub-problem of (7) when A 2 is set to 
zero. Hence (11) can be alternatively minimized as (7). 

Given the binary codes constructed from (11), a simple 
nearest neighbor classifier for each unique code in B (since 
many samples can be mapped to the same code by the op¬ 
timization) with voting is used as classifier to report recog¬ 
nition rates. The class label of the majority class in a video 
sequence is taken as the final class label of this sequence. 
In addition, since the low-rank constraint in (11) tends to 
make the column samples in B correlated, it also tends to 
reduce the number of different samples in B. We intro¬ 
duce the term compression ratio of samples for VFR, i.e., 
compression ratio = the number of unique samples/ the total 
number of samples. A lower compression ratio of an algo¬ 
rithm indicates that the algorithm needs less storage space 
(and as a consequence less computational time). 

In addition, since there is no a rank-one constraint in (11) 
(compared to (2)), compression ratio will tend to be high as 
the number of desired bits increases. If some priors of the 
rank of a video clip are given or a lower compression ratio 
is required, we can further impose a rank constraint on (11), 
resulting in the following minimization problem, 

mm||/,^(XP)-B||^ (12) 

s.t. rank{B) < r, Bij G { — 1,1} 

where rank{.) is the matrix rank operator and r is con¬ 
stant. The rank constraint in (12) makes the rank of B is 
smaller than r. That is, all binary samples can be linearly 






represented by r binary vectors. As a result, the number of 
unique samples is potentially related to r. 

4. Ordinal matrices for classification 

In this section, we discuss the design of the optimal ordi¬ 
nal matrices in (2). Then we discuss combining deep feature 
representation to improve the stability of SOM. 

4.1. The optimal ordinal matrix 

We begin the study of the optimal ordinal matrix S for 
(2) with a two-class problem. We expect that all intra-class 
and inter-class sample pairs of binary codes are well sepa¬ 
rated with a large margin, i.e., 

= E E (13) 

Ci^Cj Ci=Cj 

where B = [6i,..., is a binary matrix, fii 

and ^2 are the numbers of extra-class and intra-class pairs 
respectively, and ||. ||q is the counting norm (i.e., the number 
of nonzero entries in a vector or matrix). Each row of 
corresponds to the binary code of one data item. The first 
term of (13) rewards items from difference classes having 
large Hamming distance, while the second term penalizes 
items from the same class having small Hamming distance. 
The maximization of J{B) is NP-hard. By analyzing J{B), 
we make the following two observations on its optimal so¬ 
lution. 

Proposition 1 The maximum value of J{B) is equal to the 
number of bits (m), i.e., max^ J{B) < m. 

Proof. According to the definition of the Iq norm, we can 
easily derive that max^ (J(i?)) < m. In addition, when B 
satisfies, 

a) For \fi,j,k and Ci ^ Cj, if bik f bjk, then 

^ E \\h - bjWf^ = m; 

b) For and Ci = Cj, if bik = bjk, then 

^ E \\h-bj\\f^ = 0, 

Ci=Cj 

we obtain J{B) = m (Fig. 2 (a) gives an example of 13). 
Hence max^ J{B) < m. 

Proposition 2 If there exists a B such that J{B) = m, the 
B satisfies the following two conditions, (a) All the samples 
in each class have a unique binary code, (b) The sample 
code of one class is orthogonal to that of the other class. 
Proof. If Ci = Cj and bik f bjk, then ^ \\bi — 6j||p > 0 

Ci—Cj 

so that J{B) < m. Since bik G {0,1} and ||hi — 6j||p = m 
for Ci 7 ^ Cj, bfbj = 0. Hence bi is orthogonal to bj when 
Ci 7 ^ Cj and J{B) = m. 
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Figure 2. Three types of the optimal ordinal matrices, (a) The opti¬ 
mal ordinal matrix for a two-class problem, (b) Unsupervised ordi¬ 
nal matrix constructed via appearance information. Binary codes 
of all samples from the same class are arbitrary but unique and 
identical, (c) Supervised ordinal matrix via the spectral matrix of 
linear discriminant analysis [3]. 


From Propositions 1 and 2, we can easily obtain the 
optimal ordinal matrix for a two-class problem as shown 
in Fig. 2 (a). Previous ordinal feature selection meth¬ 
ods [32][33] actually select ordinal filters one by one so that 
the selected filters generate codes like in Fig. 2 (a). When 
there are multiple classes, the problem of determining the 
optimal binary codes becomes complex. Inspired by Propo¬ 
sitions 1 and 2, we consider two types of ordinal matrices 
to approximate the optimal ordinal matrix (shown in Fig. 2 
(b)-(c)). 

For the unsupervised ordinal matrix, we just require that 
the binary codes of each class be unique. There are many 
ways to generate informative binary codes for this case, e.g., 
random binary codes and Hadamard codes [16]. Since or¬ 
dinal filters perform learning based on human face appear¬ 
ances, we also expect that the unsupervised ordinal matrix 
would capture useful appearance information of video data. 
To accomplish this, we apply the unsupervised version of 
Iterative Quantization (PCA-ITQ) [12] to the mean faces 
of each class to generate the corresponding unique binary 
code for each class. Then, the unsupervised ordinal matrix 
contains appearance information while the binary codes of 
different classes are largely uncorrelated. 

For the supervised ordinal matrix, we simply employ 
the spectral matrix of linear discriminant analysis [3] (the 
regression target of multi-class linear regression). In this 
spectral matrix, the binary codes of the samples from any 
one class have just one bit set, which define the orders of a 
class. Since this spectral matrix contains discriminative in¬ 
formation, the ordinal matrix will contain supervised infor¬ 
mation if this spectral matrix is used as the ordinal matrix. 
However, the code length of this spectral matrix can be only 
C. If code lengths larger than C are needed, we can obtain 
longer binary codes by combining the spectral matrix with 
the unsupervised ordinal matrix. 












4.2. Deep Feature Representations 

Since there are large variations of intra-class samples 
in uncontrolled VFR environments, it is often difficult to 
use one type of local appearance features to obtain satisfac¬ 
tory recognition results. Hence, biometric researchers of¬ 
ten combine several local feature to improve generalization 
ability and recognition performance. In [44], Gabor and 
LBP were combined to enhance the representation power 
of the spatial histogram. In [5], Gabor ordinal measures 
were proposed to improve distinctiveness of Gabor features 
and robustness ofOM’s. In [21][6], different techniques are 
combined together to achieve state-of-the-art results. 

Inspired by the success of the combination of several 
appearance features, we couple SOM with deeply learned 
features from convolutional neural networks (CNN) [9] to 
improve coding stability. Benefiting from CNN’s deep ar¬ 
chitecture and supervised learning approach [2], CNN’s can 
efficiently deal with large amounts of data and generate a 
hierarchical and discriminative feature representation. The 
use of deeply learned features makes the learned ordinal 
features contain not only the prior structure from data but 
also the hierarchical structure of local image patches. 

The CNN network implemented by Alex"^ is used as our 
deep architecture. This CNN first feeds gray scale images 
to two convolutional layers, each followed by a normal¬ 
ization layer and a max-pooling layer. Then, two locally 
connected layers are connected to the output of the second 
max-pooling layer, and finally to a C-way soft-max regres¬ 
sion layer (C is the number of classes) that produces a dis¬ 
tribution over class labels. The inputs to this network are 
the cropped gray scale face images without any preprocess¬ 
ing. The last C-way soft-max regression layer provides su¬ 
pervised information for learning face representations. The 
outputs of the last locally connected layers are employed as 
deep feature representations. 

5. Experiments 

In video-sharing websites, there are a large number of 
face videos, each of which contains hundreds of face im¬ 
ages. Using binary features to represent these face im¬ 
ages will significantly save computational power and stor¬ 
age space. Hence, VFR is a good test platform to evalu¬ 
ate SOM. All experiments are run 10 times by repeating 
the random selection of training/testing set. For all binary 
code methods, the simple nearest neighbor classifier for 
each unique code in the probe set with voting is used as 
a classifier to report recognition rates. 

5.1. Methods 

We systematically compare SOM with popular tech¬ 
niques from three categories. SOMl and SOM2 indicate 

"^https : //code . google . com/p/cuda-convnet/ 


Algorithm 1 using the last two structures from Fig. 2 (b)- 
(c) respectively. For SOM2, the bits from the optimal ma¬ 
trix for SOMl is appended to that for SOM2 as discussed in 
Section 4 if code length is larger than the number of classes. 

For the first category, we compare SOM with state-of- 
the-art data-driven binary feature methods in biometrics, 
including discriminant face descriptor (DFD) [21], Gabor 
ordinal measures (GOM) [5], and compact binary face de¬ 
scriptor (CBFD) [27]. As in [27], cosine distance is used 
for the three methods to achieve their best recognition accu¬ 
racy. Since the feature dimensions of DFD and CBFD are 
too high, whitened PCA (WPCA) is applied to reduce their 
feature dimensions to 1000 [27]. 

For the second category, we compare SOM with pop¬ 
ular hashing methods, including locality sensitive hashing 
(LSH) [11], iterative quantization (ITQ) [12], kernel-based 
supervised hashing (KSH) [26], fast supervised hashing 
(FastH) [24], and supervised discrete hashing (SDH) [30]. 
For ITQ, its supervised version (CCA-ITQ) and unsuper¬ 
vised version (PCA-ITQ) are included. PCA is used as a 
preprocessing step for CCA-ITQ. For SDH, we use the no¬ 
tation SDH-n to indicate that SDH uses image pixels rather 
than nonlinear RBF kernel mapping as its input. Hamming 
distance is computed on each pair of face samples in train¬ 
ing/testing sets. 

For the last category, we compare SOM with pop¬ 
ular VFR methods, including discriminative canonical 
correlations (DCC) [18], manifold discriminant anal¬ 
ysis (MDA) [37], sparse approximated nearest point 
(SANP) [1], sparse representation for video (SRV) and 
its kernelized version KSRV [7], covariance discrimina¬ 
tive learning (Cov-tPLS) [36], jointly learning dictionary 
and subspace structure (JLDSS) [43], image sets alignment 
(ImgSets) [8], regularized nearest points (RNP) [42], and 
mean sequence sparse representation-based classification 
(MSSRC) [29]. As in [42][29][7][43] , we directly cited the 
best recognition rates of these methods from the literature. 

5.2. Databases 

Three commonly used face video datasets are used to 
evaluate different methods, including. 

The Honda/UCSD dataset [20] is composed of 59 
video sequences of 20 subjects. The sequences of each sub¬ 
ject contain pose and expression variations. The lengths of 
the sequences vary from 12 to 645. Fig. 3 (a) shows cropped 
images from this dataset. We follow the standard train¬ 
ing/testing configuration in [37][1][36][43]: 20 sequences 
are used for training and the remaining 39 sequences for 
testing. All video frames are used to report classification 
results. Since there are only 39 testing sequences, the 
improvement of recognition rates is 2.6% ({1/39}* 100%) 
when one additional sequence is correctly classified. 




(a) the Honda/UCSD dataset 



(b) the CMU Mobo dataset (c) the YouTube Celebrities dataset 


Figure 3. Cropped facial images of three different subjects in the three video databases respectively. 



Figure 4. Recognition rates and compression ratios of SOM under different parameter setting, (a) Recognition rates as a function of A 2 . 
(b) Compression ratios of samples as a function of A 2 . (c) Average recognition rates with or without (11). SOM-n indicates that the SOM 
method without using (11). (d) Average compression ratios of samples with or without (11). 


The Mobo (Motion of Body) dataset [15] was origi¬ 
nally published for human pose identihcation. It contains 
96 sequences of 24 different subjects walking on a tread¬ 
mill. Each subject has four video sequences corresponding 
to four walking patterns respectively. These patterns (slow, 
fast, inclined, and carrying a ball) were captured using mul¬ 
tiple cameras. Fig. 3 (b) shows some cropped images from 
three subjects. We follow the standard training/testing con- 
hguration in [37] [ 1 ] [36] [43] . One video was randomly cho¬ 
sen as training and the remaining three for testing. The im¬ 
provement of recognition rates is (1.4% = 1/72*100%) if 
one additional video sequence is correctly classihed. 

The YouTube Celebrities dataset [17] contains 1910 
video clips of 47 human subjects (actors, actresses, and 
politicians) from the YouTube website. Roughly 41 clips 
were segmented from 3 unique videos for each person. 
These clips are mostly low resolution and highly com¬ 
pressed. Each facial image is cropped to size 30 x 30 as 
shown in Fig. 3 (c). This dataset is challenging because it 
contains large facial variations (e.g., pose, illumination and 
expressions) and tracking errors in the cropped faces. Fol¬ 
lowing the standard setup, the testing dataset is composed 
of 6 test clips, 2 from each unique video, per person. The 
remaining clips were used as the input to the CNN to learn 
a 1152-D feature representation. One frame of video (one 
single image) is fed into the CNN at a time. We randomly 
selected 3 training clips, 1 from each unique video. 

5.3. Algorithmic Analysis 

Since our SOM method consists of several parts to im¬ 
prove performance, we investigate the effectiveness of each 


part on the YouTube Celebrities dataset. To simplify param¬ 
eter setting, we directly use the default parameter setting of 
/j, and Ai in the FIBFINEAR SVM source code. Hence 
there is only one parameter A 2 to control the effectiveness 
of output structures. 

Fig. 4 (a) and (b) show recognition rates and compres¬ 
sion ratios of samples as a function of A 2 respectively. Ex¬ 
perimental results are from one single run. The lower com¬ 
pression ratio of an algorithm is, the better the algorithm 
is. We observe that parameter A 2 affects both recognition 
rates and compression ratios. When A 2 is a large, the output 
structure term \\B — dominates (5). If A 2 is sufficiently 
large, the optimal solution of B will equal the ordinal ma¬ 
trix S, which indicates directly using S as the class labels 
of SVM to perform binary code learning. When A 2 tends 
to be zero, (5) becomes maximum margin clustering [41]. 
That is, we seek a global ordinal hlter matrix W to group 
the samples from the same class into several clusters. 

Since S‘^ is a rank-one matrix, B will be a rank-one ma¬ 
trix if B is equal to S. In VFR problems, a video clip often 
contains many face variations so that it is difficult to use one 
binary vector to represent all face variations. From Fig. 4 
(b), we also observe that the rank of the learned B is larger 
than 1. Hence, to keep the diversity of learned B, it is not 
a good strategy to directly use S as the class labels of SVM 
or to set A 2 to a large value, although a larger A 2 will result 
in better compression. Meanwhile, setting A 2 too small will 
also damage performance. If A 2 tends to zero, there will 
be no structure constraints to ensure that the learned ordinal 
features are similar to the optimal ordinal matrix for classi- 
hcation. Hence, the performance of SOM will decrease in 
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Figure 5. Recognition rates of different binary code learning methods. 
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Figure 6. Compression ratios of different binary code learning methods on the three testing sets. Compression ratio = the number of unique 
samples/ the total number of samples. The lower compression ratio an algorithm has, the better the algorithm is. 
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Figure 7. Compression ratios of different binary code learning methods on the three training sets. 


terms of both recognition rates and compression ratios. 

Fig. 4 (b) and (c) show recognition rates and com¬ 
pression ratios of samples without using (11) respectively. 
SOM-n indicates that the SOM method uses sgn{.) func¬ 
tion to obtain binary codes rather than using (11). We ob¬ 
serve that using (11) further improves recognition rates and 
reduces compression ratios. This indicates that our SOM 
methods can correct some binary codes such that the learned 
codes become correlated. Since video data often contain a 
large number of face samples, it is impossible to make face 
samples uncorrelated as assumed by hashing methods. Re¬ 
ducing the redundancy of video data should be helpful for 
performance. We also observe that the improvement using 
(11) is not significant. We regard these results as reasonable 
because CNN features have powerful ability to learn dis¬ 
criminative representations. Since the binary codes learned 
by SOMs are discriminative enough on CNN features, there 
is a limited potential to further improve performance. 


5.4. Comparisons to binary code methods 

Table 1 and Figures 5,6,7 show recognition rates and 
compression ratios of different binary code learning meth¬ 
ods on the three video face databases. From these results, 
we make several observations: 

High-dimensional and dense features are powerful for 
VFR. Three binary feature representation methods (GOM, 
CBFD and DFD) obtain the highest recognition rate (close 
to 100%) on the Honda dataset, and comparable recognition 
rates on the other two datasets. However, the best recog¬ 
nition rates of these three methods are obtained by cosine 
distance rather than Hamming distance. Dense feature rep¬ 
resentations will result in very high computational costs for 
VFR. For the Honda dataset, we can see that longer codes 
will lead to better recognition rates. The recognition rates of 
CCA-ITQ, LSH, FastH, SOMl and SOM2 increase quickly 
as the number of bits increases. 



































































Honda 

Mobo 

Youtube 

Methods(dim) 

RR 

CSl 

CS2 

RR 

CSl 

CS2 

RR 

CSl 

CS2 

GOM(2560) 

99.0% 

100.0% 

100.0% 

92.6% 

99.7% 

100.0% 

68.1% 

99.3% 

99.3% 

CBFD(32000) 

99.5% 

99.4% 

100.0% 

95.1% 

100.0% 

100.0% 

66.3% 

99.3% 

99.3% 

DFD(50176) 

99.2% 

100.0% 

100.0% 

93.6% 

100.0% 

100.0% 

64.7% 

99.3% 

99.3% 


Table 1. Experimental results of three state-of-the-art binary feature representation methods. ’RR’, ’CST and ’CS2’ indicate recognition 
rate, compression ratio on the testing set, and compression ratio on the training set respectively. 


Compared to the hashing methods designed for image 
retrieval, SOM methods are more effective for VFR. On all 
three databases, SOM methods achieve the highest recogni¬ 
tion rates, and consistently outperform their hashing com¬ 
petitors. This may be because SOM methods can utilize 
and preserve the structure information from face videos. 
Since SOM2 considers discriminative binary codes in its 
prior structure, SOM2 performs better than SOMl on the 
last two databases. On the YouTube database, since CNN 
features capture face variations well, SOM methods obtain 
state-of-the-art recognition rates compared to the complex 
classification models (e.g., image set models). It should be 
noted that the results for these other models are not based on 
CNN features, and their performance should improve if they 
were applied to those features. More important, SOM meth¬ 
ods use 64-bit binary features to obtain a better result than 
directly using CNN features in a nearest neighbor recog¬ 
nition framework, which offers an impressive compression 
ratio of 1152-dim CNN features. 

Binary code learning methods provide a potential way 
to reduce the number of registered samples. Since there 
are many face samples in a video clip, a lower compres¬ 
sion ratio of an algorithm indicates that the algorithm needs 
smaller storage space and computational time. Since PCA- 
ITQ and CCA-ITQ aim to quantize the face samples so 
that they are uncorrelated, they should learn different binary 
codes for different samples. However, their compression ra¬ 
tios on the training and testing sets are smaller than 100%. 
This indicates that there are some samples to have the same 
binary code, which makes the uncorrelated constraints work 
not well. In addition, compression ratios of different meth¬ 
ods on the training set seem to be lower than those on the 
testing set. This indicates that there are large difference be¬ 
tween the videos in the training and testing set so that the 
learned coding functions more accurately capture the facial 
variations in the training set than those in the testing set. 

FastH, SDH, SOMl and SOM2 obtain lower com¬ 
pression ratios than other methods, which indicates that 
these methods can reduce intra-class variations. On the 
Honda and Youtube databases, SDH’s performance seems 
to mainly benefit from its nonlinear RBF kernel mapping 
and anchor points, which forces the data to be similar to 
anchor points, resulting in low compression ratios. With¬ 
out the nonlinear mapping, SDHn performs no better than 


other methods. Since the nonlinear RBF kernel mapping is 
an independent step for SDH, this data mapping can also 
be integrated into other methods as a preprocessing step if 
applicable. In contrast to SDH, SOM methods employ low- 
rank constraints to naturally group data to different clusters 
(or anchor points). 

The optimal ordinal matrix for classification plays an 
important role for SOM. Although SOMl and SOM2 are 
both minimized by Algorithm 1, they perform differently 
in terms of recognition rate and compression ratio. This 
is because SOM makes use of ordinal matrices as output 
structures that are helpful for classification. Different output 
structures result in different characteristic SOM’s. Finding 
or defining the optimal ordinal matrix is still an open prob¬ 
lem for ordinal measure and hashing. The coding theory 
from information theory [16] may provide useful insights 
for binary code learning methods. 

5.5. Comparisons to VFR methods 

In this subsection, we compare the proposed SOM meth¬ 
ods with prevalent VFR methods that are based on hun¬ 
dreds of floating point features. Fig. 8 (a) plots the average 
recognition rates of different VFR methods on the Honda 
dataset. The interval between two dashed lines indicates 
the improvement in recognition rates (2.6%) if one addi¬ 
tional video sequence is correctly classified. The highest 
recognition rate achieved by SOM is 98.7% at 256 bits. We 
observe that the recognition rates of most of the compared 
methods are between 97.4% and 100%. This indicates that 
there is at most one misclassified sequence in the randomly 
selected subsets. These results also show that we can use 
only binary features and achieve state-of-the-art results on 
the Honda dataset. 

Fig. 8 (b) plots the average recognition rates of different 
VFR methods on the CUM Mobo dataset. The interval be¬ 
tween two dash lines indicates the improvement of recog¬ 
nition rates (1.4% = 1/72*100%) if one additional video 
sequence is correctly classified. RNP achieves the high¬ 
est recognition rate 97.4%±1.5%. In contrast, the recog¬ 
nition rate of SOM is 97.1%. This indicates that RNP out¬ 
performs SOM in some random selection cases but not in 
other cases. The reason is probably that SOM simply uses 
a nearest neighbor classifier with voting. Since SOM is a 
binary feature representation method and RNP is an image 
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Figure 8. Recognition rates of different VFR methods on the three video databases. The interval between two dashed lines indicates the 
improvement of recognition rates if one additional video sequence is correctly classified. 


set method, we consider the result of SOM to be compara¬ 
ble to that of state-of-the-art VFR methods. In addition, an 
image set algorithm can also be applied to ordinal features 
to further improve accuracy. 

Fig. 8 (c) plots the average recognition rates of differ¬ 
ent VFR methods on the Youtube dataset. We observe that 
MSSRC and SOM are the two best methods on this data 
set. Their average recognition rates are 80.8% and 87.0% 
respectively. The accuracy improvement of SOM against 
MSSRC is more than 6%. The high accuracy of MSSRC 
is due to its robust tracker that successfully tracked 92% of 
the videos as compared to the 80% tracked by other meth¬ 
ods. Since the low quality of video frames incurred by the 
high compression rate generates large tracking errors and 
noise in the cropped faces [ 1 ], a good tracker should signifi¬ 
cantly improve recognition accuracy. However, SOM did 
not use any preprocessing techniques (such as histogram 
equalization or an enhanced tracker). These results show 
that using a simple voting classifier can improve over the 
complex VFR models on the fine grained YouTube dataset. 
In addition, SOM can use a 64-bit representation to achieve 
a better recognition result than 1152-D floating point CNN 
representation, which offers an impressive compression ra¬ 
tio over CNN features. 

6. Conclusion 

We introduced the problem of designing data-driven or¬ 
dinal structures for ordinal measures learning, and devel¬ 
oped a structured ordinal measure method for video-based 
face recognition. By reformulating the problem in terms 
of an implied equivalence relation, we posed the learn¬ 
ing problem as a non-convex integer program problem that 
mainly includes two parts. The first part learns stable ordi¬ 
nal Alters to project video data into a large-margin ordinal 
space. The second seeks self-correcting and discrete codes 
by balancing the projected data and a rank-one ordinal ma¬ 
trix in a structured low-rank way. Unsupervised and super¬ 


vised structures are considered for the ordinal matrix. We 
developed an alternating minimization method to efficiently 
minimize the proposed non-convex formulation. Experi¬ 
mental results demonstrate that our SOM methods provide 
state-of-the-art results with fewer features and samples on 
three commonly used video face databases. 

The future work lies in two directions. First, our results 
show that the proposed output structures (the optimal ordi¬ 
nal matrices) are useful for video-based face recognition. 
Hence one direction is to design or learn optimal ordinal 
matrix based on various facial attributes, which have been 
shown to further improve recognition rates. Second, our 
results also show that SOM can efficiently compress redun¬ 
dant samples, resulting in a small set of unique samples. 
During classification, these unique samples can be treated 
as representative samples or anchor points to represent all 
video samples. Hence another potential direction is to ap¬ 
ply the proposed method to the area of representative sam¬ 
ple learning. 
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